Data Science Weather

Abstract

There has been a constant interest in weather and climate among politics as well as in private businesses. We see politicans debate findings on climate change and we see businesses turn to focus on clean or green energy believing it will be more beneficial for the enviroment. This project will look at weather data in various regions in California from January 1st, 2009 through December 31st, 2018. Because California has a multitide of different climates including coastal, inland, desert, and mountain, we decided to look at multiple cities in the different regions to better guage the characteristics of California’s climate. If we choose data from various cities without planning, we might only have data from a certain type of climate (e.g. coastal), which is not representative of California as a whole.

Introduction

Obtaining Data

To get our data we were able to request a downloadable csv file from National Oceanic and Atmospheric Administration (NOAA). The data included hourly information such as HourlyDewPointTemperature, HourlyDryBulbTemperature, HourlyPressureChange, etc., as well as daily and monthly readings. For our project we decided to focus on the daily readings with a main interest in DailyAverageDryBulbTemperature, DailyAverageRelativeHumidity, and DailyPrecipitation. However, we will also use DailyMinimumDryBulbTemperature DailyMaximumDryBulbTemperature to construct linear and logistic models.

Once we had our data, we used the read_csv() function which allowed R to read our dates as BLANK. This data is for South Tahoe Village, which I called the data frame STV.

STV <- read_csv("SouthTahoeVillage.csv")
print(ncol(STV))
print(nrow(STV))
head(STV)

To start, we are interested in looking at the vairbales of Date, DailyAverageDryBulbTemperature, DailyAverageRelativeHumidity, and DailyPrecipitation. Therefore, we manipulate our data to read the columns we are interested in. From our code, we see that column 2 corresponds to Date, 20 is DailyAverageDryBulbTemperature, 21 is DailyAverageRelativeHumidity, and 33 is DailyPrecipitation.

STV.2 <- STV[c(2, 20, 21, 33)]
  names(STV.2) <- c("Date", "DailyAverageDryBulbTemperature", "DailyAverageRelativeHumidity", "DailyPrecipitation")

Cleaing up the Data

Following the inspection of our data, we leanrned that because the data contains hourly information, then we see in the Daily columns that 23 out of every 24 rows has an NA. Since we only want that one time that holds the Daily information, we can omit all the NA.

summary(STV.2)

Currently, we see that there are 115574 NA’s in the DailyAverageRelativeHumidity column and 115524 NA’s in the DailyAverageDryBulbTemperature column. It is important to note that the DailyPrecipitation column is being read as a character, we will come back to this later.

To remove the NA’s we can use the na.omit() function. I will call the new data frame that holds our columns of interest as well as being NA free, STV.clean.

STV.clean <- na.omit(STV.2) #remove NA's
rm(STV.2) #remove STV.2 data set because I dont need it anymore
summary(STV.clean)

Now, our data in NA free but we have an issue with DailyPrecipiation, because we cannot do caclulations on character types. By examining our data, we see that there are T’s in the precipitation column. By researching this, we learned that T stands for Trace, which is an amount of rain that is less than (but not equal to) 0.01 inches.

When entering the graphing phase of our report, we came across a few obstacles. In our Daily Precipitation variable, we noticed that ‘T’ and ‘Ts’ characters were appearing in our data set. The ‘T’ and ‘Ts’ is a common term used to describe very little or “Trace” amounts of precipitation. This led us to understand that the data type of our precipitation variable had been defined as Character instead of Integer. To make more accurate plots we needed to change the ‘T’ and ‘Ts’ characters to their corresponding integer values.

This was done using the mutate() function which implements a substring command, creating a new variable to hold the same values as our original, but will be edited to not lose original data. We then used filter() to filter out all of the values that are ‘T’, ‘Ts’, and 0.00s to find the five number summary of precipitation without those values and compare it to the five number summary with those values.

Next, we mutate our new variable to itself, using the as.numeric command to convert characters to integers. Now that we have only integer values, we obtain the five-number summary without the ‘T’ and ‘Ts’ in our data. Repeating this process without removing the ‘T’, ‘Ts’ and 0.00s allows us to obtain a five-number summary with them included too. We find that our five-number summary did not change drastically, however, our averages did increase or decrease slightly. We therefore decided to keep the ‘T’ and ‘Ts’ for better accuracy and set them to 0.005 inches.

This was done in a different file so errors are present (because ontWeather), I have since moved my files elsewhere and lost it. However, the process was strictly used to validate the amount a “Trace” should be. I will call ontWeather to be STV.clean in the mean time.

ontWeather <- STV
meanPrep <- ontWeather %>% 
   mutate(dailyPrecipitation = str_sub(DailyPrecipitation, 1, 4)) %>% 
   filter(dailyPrecipitation != 'T' & dailyPrecipitation !='0.00' &   dailyPrecipitation != 'Ts') %>% 
   mutate(dailyPrecipitation = as.numeric(dailyPrecipitation)) %>% 
   select(DailyPrecipitation, dailyPrecipitation) 
  
min(meanPrep$dailyPrecipitation) 


### Since the min non-zero, non-T(s) value is 0.01, try T=0.005. 
   ### Create the numeric variable (with NAs) 
   ontWeatherTest <- ontWeather %>% 
     mutate(dailyPrecipitation = as.numeric(str_sub(DailyPrecipitation, 1, 4)))  

  ### Check to see where the NAs are 
   table(is.na(ontWeatherTest$dailyPrecipitation), ontWeather$DailyPrecipitation=="T") 


  ### It looks like there were 3 "Ts".  Check to see what the values were. 
ontWeather$DailyPrecipitation[is.na(ontWeatherTest$dailyPrecipitation)]

### Get a summary of the daily precip with the T(s) removed 
   summary(ontWeather$dailyPrecipitation) 


  ### Replace the NAs with 0.005 
   ontWeather$dailyPrecipitation[is.na(ontWeatherTest$dailyPrecipitation)] <- 0.005 

  ### Recompute the summary 
   summary(ontWeatherTest$dailyPrecipitation) 


  ### The 5-number-summary is unchanged and the mean didn't move much 
   ### Check how many values are non-zero 
   table(ontWeatherTest$dailyPrecipitation) 


  ### Replot with the numeric precipitation 
   ontWeatherTest <- ontWeather %>% 
     mutate(dailyAverageDryBulbTemperature = as.numeric(str_sub(DailyAverageDryBulbTemperature, 1, 2)))  

###temperature 
ontWeather <- ontWeather %>% 
     mutate(DailyAverageDryBulbTemperature = as.numeric(str_sub(DailyAverageDryBulbTemperature, 1, 2))) 
  
###precipitation 
ontWeather <- ontWeather %>% 
     mutate(DailyPrecipitation = as.numeric(str_sub(DailyPrecipitation, 1, 4))) 

 ontWeather$DailyPrecipitation[is.na(ontWeather$DailyPrecipitation)] <- 0.005 

 
STV.clean$DailyPrecipitation <- as.numeric(as.character(STV.clean$DailyPrecipitation)) #convert to numeric 


STV.clean[is.na(STV.clean)] <- 0.005 #make NA's equal to 0 
summary(STV.clean) 

sum(STV.clean$DailyPrecipitation) #in inches

STV.clean$DailyPrecipitation <- as.numeric(as.character(STV.clean$DailyPrecipitation)) #convert to numeric
STV.clean[is.na(STV.clean)] <- 0.005 #make NA's equal to 0
summary(STV.clean)

sum(STV.clean$DailyPrecipitation) #in inches

Our data ranges from Jan 1st, 2009 through Dec 31st, 2018. I want to break the data up into seperate data frames for each year, so we can do individuals plots and calculations.

STV.2009 <- STV.clean[1:365, c(1,2,3,4)]
  names(STV.2009) <- c("Date", "DailyAverageDryBulbTemperature",  "DailyAverageRelativeHumidity", "DailyPrecipitation")

STV.2010 <- STV.clean[366:725, c(1,2,3,4)]
  names(STV.2010) <- c("Date", "DailyAverageDryBulbTemperature",  "DailyAverageRelativeHumidity", "DailyPrecipitation")

STV.2011 <- STV.clean[726:1088, c(1,2,3,4)]
  names(STV.2011) <- c("Date", "DailyAverageDryBulbTemperature",  "DailyAverageRelativeHumidity", "DailyPrecipitation")

STV.2012 <- STV.clean[1091:1454, c(1,2,3,4)]
  names(STV.2012) <- c("Date", "DailyAverageDryBulbTemperature",  "DailyAverageRelativeHumidity", "DailyPrecipitation")

STV.2013 <- STV.clean[1455:1788, c(1,2,3,4)]
  names(STV.2013) <- c("Date", "DailyAverageDryBulbTemperature",  "DailyAverageRelativeHumidity", "DailyPrecipitation")

STV.2014 <- STV.clean[1789:2149, c(1,2,3,4)]
  names(STV.2014) <- c("Date", "DailyAverageDryBulbTemperature",  "DailyAverageRelativeHumidity", "DailyPrecipitation")

STV.2015 <- STV.clean[2150:2507, c(1,2,3,4)]
  names(STV.2015) <- c("Date", "DailyAverageDryBulbTemperature",  "DailyAverageRelativeHumidity", "DailyPrecipitation")

STV.2016 <- STV.clean[2508:2872, c(1,2,3,4)]
  names(STV.2016) <- c("Date", "DailyAverageDryBulbTemperature",  "DailyAverageRelativeHumidity", "DailyPrecipitation")

STV.2017 <- STV.clean[2873:3237, c(1,2,3,4)]
  names(STV.2017) <- c("Date", "DailyAverageDryBulbTemperature",  "DailyAverageRelativeHumidity", "DailyPrecipitation")

STV.2018 <- STV.clean[3237:3570, c(1,2,3,4)] #no data in month of dec 2018
  names(STV.2018) <- c("Date", "DailyAverageDryBulbTemperature",  "DailyAverageRelativeHumidity", "DailyPrecipitation")

After we broke up the years into their own data frames, we expect all the frames to either have 365 or 366 observations depending if it is a leap year. However, we see that some of our observations in the data frames fall short suggesting we have missing dates. By researching the missing dates, we find that government shutdowns prevent data acquisition.

Interpreting the Data

First, we wanted to visualize our data to check if there are any obvious trends. I first plotted the Date versus DailyAverageRelativeHumidity with the points being characterized by the DailyAverageDryBulbTemperature. As we see in our plot South Tahoe Village 2009: Relative Humidity v. Date, the humidty starts off relatively high and drops down during the middle of the year before increasing again. We can plot our remaining years and notice that there is a seasonal trend: it is more humid in the fall and winter months, and drier in the summer months. Furthermore, we notice that the warmer the temperature, the less humid it is. We constructing the graph I had to use scale_color_distiller() when applying the color scale because I needed a continuous scale. If I used a non-continuous scale (discrete scale) it would cause the legened to give a color to every individual temperature, which is impractical because temperature is continuous.

p2009 <- ggplot(STV.2009, aes(Date, DailyAverageRelativeHumidity)) +
  geom_point(aes(color = DailyAverageDryBulbTemperature)) +
  scale_color_distiller(palette = "Spectral") +
  geom_smooth(linetype = 1, alpha = .4  , colour="purple", method="loess", 
se=F, fullrange=T, size=1) +
ggtitle("South Tahoe Village 2009: Relative Humidity v. Date")
p2009

Because we have so many years and cities, it is inefficient to have 10 individual plots so I attempted to overlay them into one visual. To overlay them I needed to transform the data frame, where instead of reading the dates with months and years, I only read the day of the year. By doing this, I can plot each year’s DailyAverageRelativeHumidity, with respect to the number of the day in the year. Similarily to South Tahoe Village 2009: Relative Humidity v. Date, I want to apply a color scale to characterize the years. I used a discrete scale because my years are a discrete variable. An issue I ran into when graphing the overlay plot is not having enough colors for the color scale. I first used the R color palletes such as RuPu, but it did not have enough colors for all 10 years, so I went a installed the package viridis. By installing the package I was able to use the discrete color scale called scale_color_viridis_d() which had enough colors for the years.

Overlay.humid <- transform( STV.clean,               
               Year = format(Date, '%Y'),
               DayOfYear = as.numeric(format(Date, '%j')),
               Humidity = c(DailyAverageRelativeHumidity),NA)

ggplot(Overlay.humid, aes(DayOfYear, Humidity)) + 
  geom_smooth( aes(colour = Year), method = 'loess', se = F) +
  ggtitle("South Tahoe Village: DailyAverageRelativeHumidty v. Day of the Year, 2009 - 2018")

  scale_color_viridis_d()

Based on DailyAverageRelativeHumidty v. Day of the Year: 2009 through 2018 we see that there is a common trend per year, where the humidity is higher in the fall, winter months and lower in the summer months. We can also plot DailyDryBulbTemperature versus Day of the Year with similar techniques. The generic plot for South Tahoe Tahoe 2009: DailyDryBulbTemperature versus Date corresponds to South Tahoe Village 2009: Relative Humidity v. Date where we also see that humidty is lower for higher temperatures.

t2009 <- ggplot(STV.2009, aes(Date, DailyAverageDryBulbTemperature)) +
  geom_point(aes(color = DailyAverageRelativeHumidity)) +
  scale_color_distiller(palette = "Spectral") +
  geom_smooth(linetype = 1, alpha = .4  , colour="purple", method="loess", 
se=F, fullrange=T, size=1) +
ggtitle("South Tahoe Village 2009: Daily Average Temp v. Date")
t2009

Overlay.temps <- transform( STV.clean,               
               Year = format(Date, '%Y'),
               DayOfYear = as.numeric(format(Date, '%j')),
               Temperature = c(DailyAverageDryBulbTemperature),NA)

ggplot(Overlay.temps, aes(DayOfYear, Temperature)) + 
  geom_smooth( aes(colour = Year), method = 'loess', se = F) +
  ggtitle("South Tahoe Village: DailyAverageDryBulbTemperature v. Day of the Year, 2009 - 2018") +
  scale_color_viridis_d()

Logistic Regression Model

We want to see if we can use the variables DailyMinimumDryBulbTemperature and DailyMaximumDryBulbTemperature to predict whether or not it will rain. I will first use a logistic regression model (lrm) to look at the probability of rainfall based on my two variables. I made a new data frame with our variables of interest.

##need to run so you can get the TRUE and FALSE column
STV.temps <- STV[c(2, 21 ,29 ,30, 33)]
names(STV.temps) <- c("Date","DailyAverageRelativeHumidity", "DailyMaximumDryBulbTemperature", "DailyMinimumDryBulbTemperature", "DailyPrecipitation")
STV.temps.clean <- na.omit(STV.temps) #remove NA's
rm(STV.temps) #remove STV.2 data set because I dont need it anymore
STV.temps.clean$DailyPrecipitation <- as.numeric(as.character(STV.temps.clean$DailyPrecipitation)) #convert to numeric
STV.temps.clean[is.na(STV.temps.clean)] <- 0.005 #make NA's equal to 0
summary(STV.temps.clean)

Since I am planning on doing a logistic model which deals with binary data, I need to make a column of data that has TRUE FALSE information if it rained or not. We do this by using the mutate() function to make a Rain column with the value TRUE or FALSE depending if the amount of percipitation is greater than 0 or equal to 0, respectively.

STV.temps.2 <- STV.temps.clean %>%
  mutate(Rain = ifelse(DailyPrecipitation > 0, TRUE, FALSE))
  head(STV.temps.2)
summary(STV.temps.2)

## load a few packages
  p_load(Hmisc)
  p_load(xtable)
  p_load(lattice)
  p_load(rms)  ### Modern R replacement for Design package

Once the neccesary packages are installed, it is time to construct the logistic model. I will be using the variables DailyAverageRelativeHumidity, DailyMaximumDryBulbTemperature, DailyMinimumDryBulbTemperature, and the interaction effect between the temperatures.

  dd = datadist(STV.temps.2)
  options(datadist="dd")
  attach(STV.temps.2)
  #Rain.lrm.temp = lrm(Rain ~ DailyAverageRelativeHumidity + DailyMaximumDryBulbTemperature * DailyMinimumDryBulbTemperature, x=TRUE, y=TRUE)
  #Rain.lrm.temp
  #anova(Rain.lrm.temp)

Looking at our output of the lrm model, we examine the coefficients to see the effects of the variables. We see that the probability of rain increases for both DailyMinimumDryBulbTemperature as well as DailyAverageRelativeHumidity. There is also a decrease in the probability of precipitation as DailyMaximumDryBulbTemperature increases. We can now plot the probability of the rain, which supports the coefficients from the output of our model.

rain.lrm.predict <- Predict(Rain.lrm.temp, fun=plogis)

## Error in Predict(Rain.lrm.temp, fun = plogis): object 'Rain.lrm.temp' not found

yhat <- rain.lrm.predict$yhat

## Error in eval(expr, envir, enclos): object 'rain.lrm.predict' not found

lb <- rain.lrm.predict$lower

## Error in eval(expr, envir, enclos): object 'rain.lrm.predict' not found

up <- rain.lrm.predict$upper

## Error in eval(expr, envir, enclos): object 'rain.lrm.predict' not found

plot(rain.lrm.predict, col = "grey", ylab="Probability of Rain")

## Error in plot(rain.lrm.predict, col = "grey", ylab = "Probability of Rain"): object 'rain.lrm.predict' not found

Generalized Linear Model

#Subset to the model variables
#create new data frame

df <- data.frame(minTemp = STV.temps.clean$DailyMinimumDryBulbTemperature, 
                 maxTemp = STV.temps.clean$DailyMaximumDryBulbTemperature,
                 obsPrecip = STV.temps.clean$DailyPrecipitation)

#Fit obs data using glm with interaction

Precip.glm <- glm(obsPrecip ~ minTemp * maxTemp, data = df)
summary(Precip.glm)
anova(Precip.glm)
#check residuals
plot(Precip.glm)

#create a grid of temps, where we need to make predictions
rowby <- 1
colby <- 1

#x is min
x <- seq(
  min(df$minTemp),
  max(df$minTemp),
  by = colby)

#y is max
y <- seq(
  min(df$maxTemp),
  max(df$maxTemp),
  by = rowby)

temps <- expand.grid(x, y)
#need to make the variable names match the variable names from the modeled data
names(temps) <- c("minTemp", "maxTemp")

#predict at the grid points with the prediction model from the glm, previosly
precip.hat <- predict(Precip.glm, newdata = temps)

# add new predicted column to temps
temps <- as.matrix(cbind(temps, precip.hat), mode = "numeric")
head(temps)

Plotly with Precip

Plot the residuals and look at the predicted amount of rainfal for temperature combinations. Negative rainfall occurs for impossible weather combinations.

#plot the predicted value surface

p_load(plotly) #load plotly

#create limits for the axes



#plot the `temps` matrix where the variables are x= minTemp and y = maxTemp. The z is the yhat. 

test_x <- x    #sort(runif(10,0,1))
test_y <- y    #sort(test_x+runif(10,0,1))
#test_z <- runif(length(test_x)*length(test_y),0,1)
test_z_matrix <- matrix(temps[,3],nrow=length(test_x),ncol=length(test_y))
#matrix(test_z,nrow=length(test_x),ncol=length(test_y))
#for(idx1 in 1:length(test_x)){
#  for(idx2 in 1:length(test_y)){
#    test_z_matrix[idx1,idx2] = .1*test_z_matrix[idx1,idx2]+test_x[idx1]+test_y[idx2]
#  }
#}

#g <- plot_ly(x = ~x, y = ~y, z = ~temps) %>% add_surface(


g <- plot_ly(x = ~test_x, y = ~test_y, z = ~test_z_matrix) %>% add_surface(
  countours = list(
    z = list(
      show = TRUE,
      usecolormap = TRUE,
      highlightcolor = "#ff0000",
      project = list(z=TRUE)
    )
  )
) %>%
  layout(
    scene = list(
      camera=list(
        eye = list(x=1.87, y=0.88, z=-0.64)
    )
  )
)
g